Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-37093: [Python] Add async Flight client with GetFlightInfo #36986

Merged
merged 7 commits into from
Aug 23, 2023

Conversation

lidavidm
Copy link
Member

@lidavidm lidavidm commented Aug 1, 2023

Rationale for this change

Demonstrate how to deal with async Flight from Python.

What changes are included in this PR?

Add bindings for GetFlightInfo.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes, new APIs.

@github-actions

This comment was marked as resolved.

@lidavidm lidavidm changed the title WIP: [Python] Add async Flight client with GetFlightInfo GH-37093: [Python] Add async Flight client with GetFlightInfo Aug 9, 2023
@lidavidm lidavidm force-pushed the flight-async-py-client-getflightinfo branch from 25fa26f to 1bf01ae Compare August 9, 2023 15:04
@lidavidm
Copy link
Member Author

lidavidm commented Aug 9, 2023

CC @pitrou @jorisvandenbossche this is a skeleton of how to bind async C++ code to asyncio in Python, though I still need to get it working with both asyncio and Trio. (I'm looking at https://github.com/codypiersall/pynng/blob/master/pynng/_aio.py as a reference of how we might do that.)

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good start to me.

python/pyarrow/_flight.pyx Outdated Show resolved Hide resolved
python/pyarrow/_flight.pyx Outdated Show resolved Hide resolved
python/pyarrow/_flight.pyx Outdated Show resolved Hide resolved
@@ -1320,6 +1399,11 @@ cdef class FlightClient(_Weakrefable):
check_flight_status(CFlightClient.Connect(c_location, c_options
).Value(&self.client))

def as_async(self) -> None:
if anyio is None:
raise RuntimeError("anyio is required for the async interface")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we would want anyio. Just start with plain asyncio?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main thing is to support both asyncio and Trio users (though I think anyio is unnecessary and we can just directly support them both)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we want that, though? asyncio is the de facto standard. And the anyio API for delegating stuff to external threads seems rather convoluted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a reason to just lock out a part of the ecosystem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And FWIW, one of the very first issues opened against ADBC was request for async support via anyio specifically, to support both Trio and asyncio

python/pyarrow/src/arrow/python/flight.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Aug 9, 2023
@lidavidm lidavidm force-pushed the flight-async-py-client-getflightinfo branch from 1bf01ae to a89fe71 Compare August 9, 2023 19:10
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 9, 2023
@lidavidm
Copy link
Member Author

lidavidm commented Aug 9, 2023

Updated, added basic tests. I'll deal with Trio in a later PR, as well as raising an RpcError-type exception instead of built in exceptions (again: RPC errors should be distinct, and should be handled uniformly), once I've reviewed the C++ side and decided I'm happy with TransportStatusDetail.

There's more we can do with e.g. pytest-asyncio and such (down the line, we'd want all tests to be parametrized on asyncio vs Trio) that I don't think is worth tackling immediately.

@lidavidm lidavidm marked this pull request as ready for review August 9, 2023 19:11
python/pyarrow/_flight.pyx Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Aug 9, 2023
@lidavidm lidavidm force-pushed the flight-async-py-client-getflightinfo branch from a89fe71 to 68f2aa0 Compare August 9, 2023 19:25
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 9, 2023
@lidavidm
Copy link
Member Author

Hmm, that's an interesting segfault in CI.

@lidavidm
Copy link
Member Author

Alright:

  • The weird SystemError: <class 'SystemError'> returned a result with an error set is inscrutable to me. It occurs along with a TypeError: __traceback__ must be a traceback or None message. with_traceback doesn't appear to be able to clear the problematic traceback and I can't get gdb to break on the source of the error properly.

  • The segfault is also related to tracebacks, somehow; it reproduces with PYTHONDEVMODE and gdb shows it happens somewhere in tb_traverse, where obj->ob_type points to something that appears to have already been deallocated:

    #0  _PyObject_IS_GC (obj=0x7fff656ce7af) at /usr/local/src/conda/python-3.11.3/Include/object.h:772
    #1  visit_decref (op=0x7fff656ce7af, parent=parent@entry=0x7fff656ce710)
        at /usr/local/src/conda/python-3.11.3/Modules/gcmodule.c:452
    #2  0x000055555583df5b in tb_traverse (tb=0x7fff656ce710, visit=0x555555723a20 <visit_decref>, arg=0x7fff656ce710)
        at /usr/local/src/conda/python-3.11.3/Python/traceback.c:182
    

@lidavidm
Copy link
Member Author

So, I don't understand why, but bouncing through a pure-Python thread fixes the segfault. It seems something about having the C-callback in the traceback at some point borks the interpreter. It does seem that the asyncio implementation tries to get the traceback unconditionally (in call_soon_threadsafe -> call_soon -> asyncio.events.Handle) so I suppose maybe Cython isn't giving a valid traceback object for _client_async_get_flight_info? (Should probably go confirm that.)

No clue about the first error...

@pitrou
Copy link
Member

pitrou commented Aug 10, 2023

Can the segfault be reproduced locally?

@lidavidm
Copy link
Member Author

Can the segfault be reproduced locally?

Yes, just run the test with PYTHONDEVMODE. The other error only seems to reproduce under Docker though I haven't tried very hard so far.

@pitrou
Copy link
Member

pitrou commented Aug 10, 2023

Ok, I'll take a look hopefully on Monday.

@lidavidm
Copy link
Member Author

lidavidm commented Aug 10, 2023

Looking at these:

It looks like Cython indeed doesn't generate stack frames for calls, and also modifying wakeup here to try to call sys._getframe like asyncio.event.Handle does crashes with call stack is not deep enough. So indeed, it seems we should avoid touching asyncio or generating exceptions from the native callback. Bouncing through a Python thread seems unavoidable from that perspective, unfortunately. (asyncio tries to generate stacktraces to support async debugging.)

@pitrou pitrou force-pushed the flight-async-py-client-getflightinfo branch from 0cc328e to 4154e4f Compare August 23, 2023 15:03
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 23, 2023
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot. This really opens interesting possibilities, I think.

@pitrou

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

Comment on lines +1263 to +1271
async def get_flight_info(
self,
descriptor: FlightDescriptor,
*,
options: FlightCallOptions = None,
):
call = AsyncioCall()
self._get_flight_info(call, descriptor, options)
return await call.as_awaitable()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could just be:

Suggested change
async def get_flight_info(
self,
descriptor: FlightDescriptor,
*,
options: FlightCallOptions = None,
):
call = AsyncioCall()
self._get_flight_info(call, descriptor, options)
return await call.as_awaitable()
def get_flight_info(
self,
descriptor: FlightDescriptor,
*,
options: FlightCallOptions = None,
):
call = AsyncioCall()
self._get_flight_info(call, descriptor, options)
return call.as_awaitable()

but I've kept the explicit async def for introspection and self-documentation... what do you think @jorisvandenbossche ?

@pitrou
Copy link
Member

pitrou commented Aug 23, 2023

At least test-debian-11-python-3 has failed with:
https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=53319&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=5448

usr/local/lib/python3.9/dist-packages/pyarrow/tests/test_flight_async.py:59: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
usr/lib/python3.9/asyncio/runners.py:44: in run
    return loop.run_until_complete(main)
usr/lib/python3.9/asyncio/base_events.py:642: in run_until_complete
    return future.result()
usr/local/lib/python3.9/dist-packages/pyarrow/tests/test_flight_async.py:56: in _test
    info = await async_client.get_flight_info(descriptor)
pyarrow/_flight.pyx:1281: in get_flight_info
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   pyarrow.lib.ArrowNotImplementedError: gRPC 1.40 or newer is required to use async

We could perhaps expose supports_async() in Python.

@pitrou
Copy link
Member

pitrou commented Aug 23, 2023

That said, it would perhaps be nicer to expose the same error message in Python as in C++...

@pitrou
Copy link
Member

pitrou commented Aug 23, 2023

@github-actions crossbow submit -g python

/// This is like supports_async(), except that a detailed error message
/// is returned if async support is not available. If async support is
/// available, this function returns successfully.
Status CheckAsyncSupport() const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lidavidm Does this API look ok?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though I would be happy to replace supports_async entirely with this.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Aug 23, 2023
@github-actions
Copy link

Revision: 58a938a

Submitted crossbow builds: ursacomputing/crossbow @ actions-b4276e2530

Task Status
test-conda-python-3.10 Github Actions
test-conda-python-3.10-hdfs-2.9.2 Github Actions
test-conda-python-3.10-hdfs-3.2.1 Github Actions
test-conda-python-3.10-pandas-latest Github Actions
test-conda-python-3.10-pandas-nightly Github Actions
test-conda-python-3.10-spark-v3.4.1 Github Actions
test-conda-python-3.10-substrait Github Actions
test-conda-python-3.11 Github Actions
test-conda-python-3.11-dask-latest Github Actions
test-conda-python-3.11-dask-upstream_devel Github Actions
test-conda-python-3.11-hypothesis Github Actions
test-conda-python-3.11-pandas-upstream_devel Github Actions
test-conda-python-3.11-spark-master Github Actions
test-conda-python-3.8 Github Actions
test-conda-python-3.8-pandas-1.0 Github Actions
test-conda-python-3.8-spark-v3.4.1 Github Actions
test-conda-python-3.9 Github Actions
test-conda-python-3.9-pandas-latest Github Actions
test-cuda-python Github Actions
test-debian-11-python-3 Azure
test-fedora-35-python-3 Azure
test-ubuntu-20.04-python-3 Azure
test-ubuntu-22.04-python-3 Github Actions

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 23, 2023
@pitrou
Copy link
Member

pitrou commented Aug 23, 2023

CI failures are unrelated.

@pitrou pitrou merged commit b6ab909 into apache:main Aug 23, 2023
32 of 33 checks passed
@pitrou pitrou removed the awaiting change review Awaiting change review label Aug 23, 2023
@lidavidm lidavidm deleted the flight-async-py-client-getflightinfo branch August 23, 2023 17:38
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit b6ab909.

There were 4 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…pache#36986)

### Rationale for this change

Demonstrate how to deal with async Flight from Python.

### What changes are included in this PR?

Add bindings for GetFlightInfo.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes, new APIs.

* Closes: apache#37093

Lead-authored-by: David Li <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FlightRPC][Python] Implement async GetFlightInfo
2 participants